-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update example-dvc-experiments
with dvc exp init
and confusion matrix
#97
Conversation
@iesahin could you please add the context in the description? why, previous discussions, etc. That would help to understand this :) |
I shouldn't add these just before the meeting :) |
Thanks, Emre. It's still not enough context to meaningfully review this :( Why do we need this repo? What is the plan - replace existing? keep both? etc? What was the motivation behind doing this? |
Actually, this was a temporary PR. Your confusion is because of not marking this as a draft or WIP I think. Sorry. I'm testing how to generate a repository based on
Basically, what (the current) Another point is to make the experiment in a single stage if we use
Option (b) proved to be too slow, will take at least 20-25 minutes to download, and I know (from our previous discussions) you don't want to work on a single file as the dataset as in option (a).
My motivation was testing |
How about a separate section that is focused more on
is it still the case? There were some improvements as far as I know ... could you point me to the dataset please to experiment a bit?
it seems realistically, DVC doesn't handle 70K at the moment ... at least for the quick start/get started project where speed is important should we consider for now using something smaller/artificial? cc @dberenbaum @efiop ? |
What about starting by replacing the hidden
It's a lot of steps (basically what's there now plus No strong opinion on whether to keep this hidden or make a new section for it.
Let's check the times now, but IMO it's fine to use a subset of the data or a different data set if it still takes too long. Most users understand that tutorials use toy data to keep things moving. |
I've added some These are on WSL with a fairly good Windows laptop. I'll also test these on Google Cloud VM. You can test these yourselves by generating the repository with this branch: Some results:
The following are for
The following is for 4 experiments, set to run 2-by-2 in parallel. Note that plain
And finally, :)
|
The following are the
Please note that the difference between parallel Also, in this VM case, adding to the experiment queue takes around 3 minutes, vs 40 seconds on WSL. No other major processes were running during this test. BTW, a plain |
This is certainly possible, though I'm not sure if it's worth it. I was expecting Another problem is the performance of
This project is already a toy project, less than 40 MB of data in 70K small files. No serious user would have such a small project, our intended user base works with TBs level of data with millions of files. As a user, I'm frustrated from the slowness of DVC, and trying to come up with solutions to overcome this for the example projects. I believe we have more serious issues than writing a good tutorial. Let me ask this straight, would you use DVC in a project with millions of files? |
Sorry, I may have given the wrong impression. Those features would be nice, but the primary purpose is to help users get started with experiments. The hope is that As far as needing additional commands, auto
Maybe not -- I'm not really sure today. We are in the middle of changes to address these performance issues, especially for many files (not to mention we have an entirely new product being developed specifically to address this type of scenario). Please continue to comment in relevant issues in the core repo and open issues from your findings here. Maybe we can use these in dvc benchmarks. In the meantime, we still need to address docs needs. FWIW, my experience is that I have used it for data in the 100s of GBs and found it extremely useful. I think it can feel slow and better performance would have a major impact, but I want to clarify from both personal experience and community interactions that it is useful today in real-world applications. A few points might explain this:
While we wait for performance to improve, what other options do we have to move the docs forward?
Any other ideas? |
Dave, When we discussed this topic a few months ago, Ivan assured me that the core team has a plan regarding these issues. I'm in no position to decide whether that plan is feasible or not, (and certainly I never intend to be a manager or criticize anyone or push the team to certain direction) but the current situation is not impressive, and I feel frustrated when it comes to tell features of a product that I cannot use pleasantly. Note that my concerns are concerns of a user, not someone who's making decisions about the project. I used to use DVC to track my personal collections, but currently I don't. When I'm using our own product only because of the professional reasons, I believe that's a red flag. I can write tickets, but I don't think the gravity of the situation is well understood. Performance (and security) are two aspects that you cannot add to a software project later, they are not like features that you can add at a certain point. Every technical decision regarding features must be made also considering its effect on performance (and security.) Regarding the particular changes for the example project: I think we can keep the current docs and the project until the performance issues are resolved. I can convert the project to its original format, where the images are loaded from a single file, but I believe that's not @shcheklein would want. |
@iesahin I think Dave and the team are very clearly understand the problem and are trying to address it as fast as possible. No one was saying that performance or security are not important. I see your frustration, but the part of building things fast in the early stage environment. We need to adapt quicker and find some workarounds faster. Let's try to discuss some options please and try to help the team as much as we can.
that's what we already do, but this would complicate the
probably won't work either - still many files to do it quick
what are the options here? NLP problems with DL on a text file? |
@iesahin Are you following https://github.com/iterative/dvc-bench? @efiop and others are already working on improvements there, but your input can be helpful.
@shcheklein What I mean here is that we have single stage, and inside |
@dberenbaum got it, @iesahin is it feasible? ( I remember we had some code that was reading archive on the "fly") ... may be we could even use something like hd5? |
Or TensorFlow datasets/formats that package data? |
The earliest version of this project was using MNIST's custom image format to obtain the images on the fly. (It was in IPX format and generating numpy arrays from them.) We can revert to that if it sounds good. Another option is to use a single tar file that contain PNG images. Python supports tar in the standard library. We can also convert the project to a single file NLP project, similar to example-get-started, but I don't see it's necessary and either of the above two approaches will probably suffice. |
Sounds good, Emre. Probably it's better to use tar or tesnsorflow format, custom MNIST format is too specific I guess. |
Thinking about this in the initial version, I've decided that "going with the default, as distributed from the dataset website" is more "excusable." Though it's easier to make a classifier that way, I don't like to use Tensorflow datasets, as we have the corresponding functionality in the dataset-registry. I'm using the tar version if that's the deal? @shcheklein |
example-dvc-experiments
with dvc exp init
and confusion matrix
This is ready for review and merge. @shcheklein @dberenbaum |
return (training_images, training_labels, testing_images, testing_labels) | ||
|
||
|
||
def create_image_matrix(cells): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is a cool visual, it does add a quite a bit to the example code. What about using some built-in visualization, like https://keras.io/api/utils/model_plotting_utils/#plot_model-function? It's probably much less useful, but it's certainly less example code. Not a strong opinion if others prefer this visual.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, our model structure is a bit static and not-so-exciting. It's very simple to visualize THB 😅
I can think of other, simpler ways. This came to my mind as a "confusion matrix" in image form. (I was thinking to put all classification errors in tiny little boxes on a large image, but one from each "confusion" looked better.)
The code is not that visible to the users. I don't think they'll peek inside the code that much. We'll just show the results and (maybe) link to the code in Github.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved this code to util.py
. I think we can consider this resolved.
@@ -4,7 +4,7 @@ set -veux | |||
|
|||
HERE="$( cd "$(dirname "$0")" ; pwd -P )" | |||
export HERE | |||
PROJECT_NAME="example-dvc-experiments" | |||
PROJECT_NAME="example-dvc-staging" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we rename it back?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which one should be permanent? I usually create & use (private) example-dvc-staging
more frequently than (public) example-dvc-experiments
and pushing the example repository with the created script is very easy, and may be used mistakenly to push versions with bugs, etc.
I don't have strong opinions here, the user must edit this line before generating the script.
Can we rename the repository itself after building and pushing? We can rename example-dvc-experiments
to example-dvc-experiments-22-01-26
and archive, then rename example-dvc-staging
to example-dvc-experiments
, and make public. Is it too much work?
cp "${HERE}"/code/params.yaml . | ||
## We are assuming the repo is generated in Linux | ||
## Otherwise the following line must be changed to have requirements-macos.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we detect this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we can detect it. Would you like to support Windows as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added support with uname -s
for macOS only. If you'd like Windows support, let me know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks almost good, some cleanup is required
cp "${HERE}"/code/params.yaml . | ||
pip install -r "${REPO_PATH}"/requirements.txt | ||
if [[ $(uname -s) == 'Darwin' ]] ; then | ||
pip install -r "${REPO_PATH}"/requirements-macos.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks macOS requires the whole generation to depend on conda in M1 macs. This can be solved, but it requires a generate-macos.bash
that will use conda
instead of pip
. I can work on this, WDYT? @shcheklein
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping this as is for future, maybe we'll be able to install TF with pip
on macs someday.
Updates https://github.com/iterative/example-dvc-experiments with
dvc exp init
instead ofdvc add stage
plots/confusion.csv
plots/confusion.png
confusion.png
is more clear that way.Closes #96